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What is Yandex Query? 


1/2 
Yandex Query is Stream and batch processing 
e Serverless (available in Yandex Cloud) with uniform syntax for easy 


development and debuggin 
e Federated (mix of different data D gging 


providers) 


e Query System (SQL-like) 


For Data Engineers and Analysts 


Uniform syntax increases productivity 


O Yandex Query 


CI YDB tai Amazon Sn Azure Analysis Services 
= Athena Ra 


What is Yandex Query? 


2/2 


Big data 


e Data Is stored in the cloud 


e Parallel processing 


+ Integrated with other 
services 


on WN FE 


SELECT message, CAST(microseconds as UINT64) as microseconds 
FROM bindings. logs 

WHERE microseconds > CurrentUtcTimestamp() - Interval("PT24H") 
AND priority = "ERROR" 

ORDER BY microseconds DESC; 


(278 Validate Explain Save Clone M 


@ Completed Run 10 December 2022, 23:25:50 00:20 


Result Plan AST Monitoring Statistics Meta 
Result (55) 
# message 
al Create rate limiter resource \"b1g9918]f99v96n4707u6\" error 
Fatal: Resource already exists and has different settings. } 
2 Create rate limiter resource \"b1g918jf99v96n4707u6\" error: 
Fatal: Resource already exists and has different settings. } 
3 Create rate limiter resource \"b1g918jf99v96n4707u6\" error: 


Fatal: Resource already exists and has different settings. } 


Dogfooding to improve the service 


quality 


microseconds 


: BAD REQUEST { <main>:'0 | 1670628773594231 


BAD_REQUEST { <main>: 1670628771107286 


BAD_REQUEST { <main>: 1670628725682844 


Big data processing 
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Object Storage / ... / yq-streams-logs / hot / year=2022 / month=12 / day=08 / hour=21 


Logs are stored in Object Storage wen se storage cs tc 


= yq-logs_0-6260764_6260816.raw.gz 14.98 KB Standard 08.12.2022, at 21:01 
2 La ro e n U m b e r of fi | e S ( O DJ e cts) = yq-logs_0-6260817_6260894.raw.gz 28.17 KB Standard 08.12.2022, at 21:03 


© e ro u pe d by a key D refi X (VI rt U a | fo | (| e rs) = yg-logs 0-6260895 6260976.raw.gz 24.47 KB Standard 08.12.2022, at 21:05 


= yg-logs 0-6260977 6261048.raw.gz 11.94 KB Standard 08.12.2022, at 21:07 


> Reco rd S Va ry | nN S | Ze a n d De r D refi X d Istri D uti = yq-logs_0-6261049_6261121.raw.gz 17.9 KB Standard 08.12.2022, at 21:09 


= yq-logs_0-6261122 6261197.raw.gz 15.93 KB Standard 08.12.2022, at 21:11 


Fast processing requires parallel execution on multiple nodes 


Big data processing 
2/4 


Object Storage / ... / yq-streams-logs / hot / year=2022 / month=12 / day=08 / hour=21 


Z 


ame Size Storage class Last change 


Naive solution 


= yq-logs_0-6260764_6260816.raw.gz 14.98 KB Standard 08.12.2022, at 21:01 

e List object d ly filt = 
IS O Jec S a n a D D V e rs D ru n e = yq-logs_0-6260817_6260894.raw.gz 28.17 KB Standard 08.12.2022, at 21:03 
= yq-logs_0-6260895_6260976.raw.gz 24.47 KB Standard 08.12.2022, at 21:05 


e Send subsets of keys to workers (map) 


= yq-logs_0-6260977_6261048.raw.gz 11.94 KB Standard 08.12.2022, at 21:07 


E D OVV a | O a d a n d D ro C e S S d a ta ( re d U C e ) = yq-logs_0-6261049_6261121.raw.gz 17.9 KB Standard 08.12.2022, at 21:09 


= yq-logs_0-6261122 6261197.raw.gz 15.93 KB Standard 08.12.2022, at 21:11 


Listing is the bottleneck - can we do it better? 


Big data processing 
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Better solution ina Size cea PEN 

e List key prefixes and apply filters (prune) Be tout — E _ 

+ Send subsets of prefixes to workers —_ J - 
(map) => 

e List keys by prefixes (expand) - — į į f 


+ Download and process data (reduce) 


Listing Is also parallelized, but can we do it even better? 


Big data processing 
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Extreme optimization (even load) — 


e Shuffle keys with back pressure 


+ Split single-item processing (if format supports) — 


NA 


In an ideal system all nodes start and finish simultaneously 


Multitenant Architecture 


1/2 
Multiuser/Multitenant & Managed/On Premise 
e Shared environment A O e Isolated environment 

e Elastic A » Fixed size cluster 


e Constant costs of 


e Pay as you go 
ownership 


Multitenant architecture is a common choice for cloud 
services 


Multitenant Architecture 
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Advantages Disadvantages 

e CPU utilization e Lower level of isolation 

e Low support costs e Higher failure probability 
e Scalability e Security issues 


Multitenant architecture is flexible and complex in 
design 


Capacity and isolation 
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Compute plane is 
Compute Plane isolated from the 
control plane 


Control Plane 


Load Balancer Load Balancer 
Control Plane Compute Plane 


(Public API) (Private API) 


Control Plane Compute Plane 


AULE 


Larger computing power, lower degradation in case of a node failure 


— 


Capacity and isolation 
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Multiple tenants 
to reduce blast radius 


Load Balancer Load Balancer 


(Public API) (Private API) 


Split the compute plane into isolated parts 


Capacity and isolation 
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"Meta-rolling" update 
(blue-green 
deployment) 


Load Balancer Load Balancer 


(Public API) (Private API) 


| Compute Plane 


Instant redirect for zero downtime with a small overhead (less than 
X2) 


CPU usage limitation 
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Query Is processed on multiple nodes 
e Total CPU (per request) is limited 
e Average load over the time is measured 


e Performance Is not affected 


- =m e = = BS SS SS SS -. x ey 


Usage of the shared resources in a multiuser system must be 
limited 


w i et 


CPU usage limitation 
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da 


Leaky bucket 


° Each request "fills" the bucket 
e Requests vary in size and are not aligned in time 
e Bucket "leaks" at a fixed rate 


e If bucket is full, incoming requests are delayed 


LT 


Classic rate-limiting algorithm 


CPU usage limitation 
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Distributed version 


e Limiter is a fault tolerant external service 
+ Each worker requests the quota in advance 


e Local overbudget CPU usage is allowed 


Limiting should not affect performance 


Limiter as a service 


-—-“-- SS SS SS “SS ë ë —- ë= SS “SS — m ë 


Performance and reliability 
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Resource (CPU, memory, To mitigate 
dataflow) usage control « Mishëhaved queries 

e Adjustable quotas + Large bills 

* Hard limits + Intentional DDOS attacks 


A lot of computing power is not always desirable 


Performance and reliability 
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Higher number of node System must provide 
failures as a result of expected behavior 

+ Large number of nodes e Failovers and retries 


+ Mix of queries from different users At least once 


e What about exactly once? 


In great demand, but causes extra 
overhead 


Large number of nodes Increases failure (of a single node) 
probability 


Performance and reliability 


3/4 


User can get an exactly-once processing "under certain 
conditions" 


+ At least once with deduplication 
Like UPSERT, requires determinacy 


+ Tagged data with non-transactional data providers 
With background GC 


e 2PC if a data provider supports transactions 


Exactly-once processing may affect performance 


Performance and reliability 
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Stream processing is checkpointed 
e System periodically saves the state of the request (checkpoint) 
e Failures are restarted from the last checkpoint 


e Checkpoints are stored in the transactional DB 


Checkpointing does not delay the data 
processing 


Lightweight checkpointing 
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How it works 


e Workers make up DAG, data flows from roots to leaves 
e Checkpoints (barriers) are injected in root workers 


e Checkpoints keep their position in an ordered data 
sequence 


e Worker delivers a checkpoint to every outgoing edge 


e Worker waits for checkpoints from all incoming edges 


Inspired by Flink implementation (asynchronous barrier 
Snapshotting) 


NL 


Lightweight checkpointing 


2/3 


Saving the state O 


e Saved data describes the state at the time of the 
checkpoint 


e Persisting is asynchronous (don't wait for 
completion) = 


e Root worker (source) saves an ingress stream 
position 


e Leaf worker (sink) saves egress stream info 


Data processing does not wait for checkpoint persistence 


Lightweight checkpointing 
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Restart after the failure C) C) C) 


e Worker reports to the checkpoint coordinator on 
successful persistence 


> U N m 
> W N m 
> U N E 


e Coordinator marks the checkpoint as valid and 
completed 


e Checkpointing failure does not fail request 
execution 


e After the failure the last valid checkpoint is used 


Failure of a single checkpoint doubles RPO 


Security and UX 
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Support standard cloud Compute plane is not trusted 
practices to prevent an e Sensitive data Is signed in the control 
Unauthorized access plane 


° Single cloud-wide authentication 
IAM 


» Use of service accounts to access user 
resources 


© Time-limited tokens 


Critical data is protected with a signature before passing to 
compute plane 


Security and UX 
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Use SA 


e To communicate between the system 
parts 


e Ingress/egress data providers 


e Cloud services access 


Secure presets for UX 


e Connections 
To hide data provider access 


+ Bindings (not about security, still about 
UX) 


To hide data format 


Secure and comfortable for interactive 


access 


Leave your feedback! 


You can rate the talk and 

give a feedback on what 
ou've liked or what could 
e improved 
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